21 research outputs found
SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers.
In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range.
To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems
Accurate synthesis of Dysarthric Speech for ASR data augmentation
Dysarthria is a motor speech disorder often characterized by reduced speech
intelligibility through slow, uncoordinated control of speech production
muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers
communicate more effectively. However, robust dysarthria-specific ASR requires
a significant amount of training speech, which is not readily available for
dysarthric talkers. This paper presents a new dysarthric speech synthesis
method for the purpose of ASR training data augmentation. Differences in
prosodic and acoustic characteristics of dysarthric spontaneous speech at
varying severity levels are important components for dysarthric speech
modeling, synthesis, and augmentation. For dysarthric speech synthesis, a
modified neural multi-talker TTS is implemented by adding a dysarthria severity
level coefficient and a pause insertion model to synthesize dysarthric speech
for varying severity levels. To evaluate the effectiveness for synthesis of
training data for ASR, dysarthria-specific speech recognition was used. Results
show that a DNN-HMM model trained on additional synthetic dysarthric speech
achieves WER improvement of 12.2% compared to the baseline, and that the
addition of the severity level and pause insertion controls decrease WER by
6.5%, showing the effectiveness of adding these parameters. Overall results on
the TORGO database demonstrate that using dysarthric synthetic speech to
increase the amount of dysarthric-patterned speech for training has significant
impact on the dysarthric ASR systems. In addition, we have conducted a
subjective evaluation to evaluate the dysarthric-ness and similarity of
synthesized speech. Our subjective evaluation shows that the perceived
dysartrhic-ness of synthesized speech is similar to that of true dysarthric
speech, especially for higher levels of dysarthriaComment: arXiv admin note: text overlap with arXiv:2201.1157
Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual Loss
We introduce a bilingual solution to support English as secondary locale for
most primary locales in hybrid automatic speech recognition (ASR) settings. Our
key developments constitute: (a) pronunciation lexicon with grapheme units
instead of phone units, (b) a fully bilingual alignment model and subsequently
bilingual streaming transformer model, (c) a parallel encoder structure with
language identification (LID) loss, (d) parallel encoder with an auxiliary loss
for monolingual projections. We conclude that in comparison to LID loss, our
proposed auxiliary loss is superior in specializing the parallel encoders to
respective monolingual locales, and that contributes to stronger bilingual
learning. We evaluate our work on large-scale training and test tasks for
bilingual Spanish (ES) and bilingual Italian (IT) applications. Our bilingual
models demonstrate strong English code-mixing capability. In particular, the
bilingual IT model improves the word error rate (WER) for a code-mix IT task
from 46.5% to 13.8%, while also achieving a close parity (9.6%) with the
monolingual IT model (9.5%) over IT tests
Comparison of <i>SINR</i> calculation for conventional MVDR, PSO-MVDR, GSA-MVDR, SLGSA-MVDR [28] and ECGSA-MVDR for user at 0° and interference at 30°.
<p>Comparison of <i>SINR</i> calculation for conventional MVDR, PSO-MVDR, GSA-MVDR, SLGSA-MVDR [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156749#pone.0156749.ref028" target="_blank">28</a>] and ECGSA-MVDR for user at 0° and interference at 30°.</p
Comparison of performance of power response with 100 iterations for user at 0°with two interferences at 30° and 50°.
<p>(a) MVDR, (b) PSO-MVDR, (c) GSA-MVDR, (d) SLGSA-MVDR [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156749#pone.0156749.ref028" target="_blank">28</a>] and (e) ECGAS-MVDR.</p
An Experience Oriented-Convergence Improved Gravitational Search Algorithm for Minimum Variance Distortionless Response Beamforming Optimum
<div><p>An experience oriented-convergence improved gravitational search algorithm (ECGSA) based on two new modifications, searching through the best experiments and using of a dynamic gravitational damping coefficient (<i>α</i>), is introduced in this paper. ECGSA saves its best fitness function evaluations and uses those as the agents’ positions in searching process. In this way, the optimal found trajectories are retained and the search starts from these trajectories, which allow the algorithm to avoid the local optimums. Also, the agents can move faster in search space to obtain better exploration during the first stage of the searching process and they can converge rapidly to the optimal solution at the final stage of the search process by means of the proposed dynamic gravitational damping coefficient. The performance of ECGSA has been evaluated by applying it to eight standard benchmark functions along with six complicated composite test functions. It is also applied to adaptive beamforming problem as a practical issue to improve the weight vectors computed by minimum variance distortionless response (MVDR) beamforming technique. The results of implementation of the proposed algorithm are compared with some well-known heuristic methods and verified the proposed method in both reaching to optimal solutions and robustness.</p></div
The simplified flowchart of ECGSA beamforming.
<p>The simplified flowchart of ECGSA beamforming.</p
Mass acceleration toward the result force in GSA [20].
<p>Mass acceleration toward the result force in GSA [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156749#pone.0156749.ref020" target="_blank">20</a>].</p
Comparison of weight vectors for conventional MVDR, PSO-MVDR, GSA-MVDR, SLGSA-MVDR [28] and ECGSA-MVDR for user at 0° and interferences at 30°and 50°.
<p>Comparison of weight vectors for conventional MVDR, PSO-MVDR, GSA-MVDR, SLGSA-MVDR [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156749#pone.0156749.ref028" target="_blank">28</a>] and ECGSA-MVDR for user at 0° and interferences at 30°and 50°.</p